Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #1570
This PR moves away from the current way of volume replication in which the number of replicas is increased and then subsequently decrease forcing to move it to another node while maintaining the StorageClass specified replica count,
to using longhorn setting
block-for-eviction-if-last-replica
which has the following benefits:With this setting longhorn will try to maintain the replica count defined in the StorageClass (given that there exists the required number of nodes), On node deletion this setting will block the deletion if the node being deleted is the last node that has a heatlhy replica, until a new replica is created on another node after which the deletion will continue as expected.
The tests have been done on dynamic and static nodes where I would delete entire nodepools where all of the replicas were present forcing them to move to another nodepool available. It would happen that from time to time the eviction of the replicas would be stuck for ~10-15mins but it would always manage to continue and not get stuck.
Further, I had to disable the concurrent cordoning of the nodes as I've encountered an issue where the deletion of nodes would deadlock if all of the nodes where all of the replicas would live are to be deleted, opting instead for a one-by-one cordoning and deletion of nodes.
Additionally, another bug was spotted where the StorageClasses for providers defined in the InputManifest were not correctly cleaned-up. Longhorn annotations were also moved to the
PatchAnnotations
part in Kuber microserver to have a single point where annotations are applied.